Machine learning identifies prognostic subtypes of the tumor microenvironment of NSCLC

The tumor microenvironment (TME) plays a fundamental role in tumorigenesis, tumor progression, and anti-cancer immunity potential of emerging cancer therapeutics. Understanding inter-patient TME heterogeneity, however, remains a challenge to efficient drug development. This article applies recent advances in machine learning (ML) for survival analysis to a retrospective study of NSCLC patients who received definitive surgical resection and immune pathology following surgery. ML methods are compared for their effectiveness in identifying prognostic subtypes. Six survival models, including Cox regression and five survival machine learning methods, were calibrated and applied to predict survival for NSCLC patients based on PD-L1 expression, CD3 expression, and ten baseline patient characteristics. Prognostic subregions of the biomarker space are delineated for each method using synthetic patient data augmentation and compared between models for overall survival concordance. A total of 423 NSCLC patients (46% female; median age [inter quantile range]: 67 [60–73]) treated with definite surgical resection were included in the study. And 219 (52%) patients experienced events during the observation period consisting of a maximum follow-up of 10 years and median follow up 78 months. The random survival forest (RSF) achieved the highest predictive accuracy, with a C-index of 0.84. The resultant biomarker subtypes demonstrate that patients with high PD-L1 expression combined with low CD3 counts experience higher risk of death within five-years of surgical resection.

It represents the instantaneous rate of failure at time , given that the individual has not experience the event until .
To study the association between the covariates  and the hazard of , a Cox proportional hazard model assumes the noninformative censoring, i.e.  # is independent of  # given  # ,  # ⊥  # | # .And the hazard function is modeled as ℎ(|) = ℎ E ()  N O P , where ℎ E () is the baseline hazard function, and  is the coefficient.The coefficient vector  is estimated by maximizing the partial log-likelihood , where  # denotes the observed covariate vector of the -th subject, and  # is a set of subjects who are at risk at time  # .The model implies the proportional hazard property, i.e.
Therefore, the statistical association between covariate  and the time to event can be fully studied through the coefficient  .Due to such straight-forward interpretation property, the Cox proportional hazard model is so far the most popular statistical model in survival analysis of the biomedical problems.The Cox proportional hazard model is also usually called Cox model. [32]

Survival regression model (SR)
Survival regression is a class of full parametric models in survival analysis.It directly models the relationship between survival time and covariates: log() =  +  T  + , ~ where  is a known distribution, such as standard extreme value distribution, logistic, normal etc.For example, if the  follows a standard extreme value distribution, it leads the Weibull regression model: log() =  +  T  + , ~    This model is equivalent to a proportional hazards model for  with a Weibull baseline hazard, that is, ℎ(|) =  wc: exp( T ), with  = 1/,  = exp (−/), and  | = − | /,  = 1,2, ⋯ , .

Boosted Cox regression (CoxBoost)
Boosting is an iterative machine learning technique that can be applied in various of statistical problems, including classification, regression and survival analysis.[33][34][35] A boosted Cox regression is a likelihood-based boosting algorithm in the Cox model.The parameters are iteratively estimated through maximizing the following function , where ̂=  € T  is the offset term that links the iteratively estimated parameters.[36] Random survival forests (RSF) Random survival forests (RSF) is an ensemble tree method for right-censored survival data analysis.[20] It extends the random forests, in which a collection of decision trees are built for classification and regression, [37] to time to event setting.
To adapt the idea of random forests, RSF uses the ensemble mortality as the predicted outcome which comprises both the survival time and censoring information.The expected outcome is defined as the sum of estimated cumulative hazard function (CHF) over the observed time (both censored and uncensored).For a subject , the expected mortality can be expressed as • where  # is the covariate vector of subject ,  | is the observed time for subject , and ( | | # ) denotes the value of CHF at time  | , conditioning on  # .[20]Then, the ensemble mortality for subject  is defined as where  Ž ( | | # ) is the ensemble CHF at time for  | subject  with covariates  # , and  Ž 8 | • # = ∈ [0, ∞) .Therefore, estimating ensemble CHF has been the key step in the RSF algorithm.Specifically, the ensemble CHF at time  is estimated as where  denotes the total number of trees;  ' " (| # ) is the estimated CHF in -th tree.Given an individual survival tree and  # , subject  will fall into a unique terminal node of the tree.For the -th tree, assume subject  falls into the terminal node, ℎ.Then, for the -th tree, the CHF can be estimated with various estimators, including the Nelson-Aalen estimator which can be expressed as where  •,-and  •,-are the number of deaths and number of subjects at risk at time  •,-in the terminal node ℎ, respectively.
Following the random forests algorithm, the RSF algorithm can be described as 1) Random generate a boostrap training samples from the original data set.
2) Grow survival tree for the boostrap samples.At each node, random select  variables to split the node into two daughter nodes with the aim of maximizing the survival difference between the daughter nodes.The tree is built until the terminal node has no less than  >0 deaths.3) Calculate CHF for each tree 4) Repeat 1)-3)  times 5) Calculate the estimated ensemble CHF based on  individual trees.

Oblique random survival forests(ORSF)
Oblique random survival forests (ORSF) is also an ensemble tree-based method for right-censor survival analysis.In each fold of the crossvalidation procedure, one fold of the data is left out for performance evaluation; and the rest of  − 1 folds are used for training (i.e., parameter estimation with some optimization algorithm).The model performance is evaluated by the averaged value of  performance measures.Based on the -fold cross-validation, the performance of different types of models can be compared and model selection processes can be conducted.
However, machine learning models commonly comprise one or more hyper-parameters which allow the model to adaptively fit the specific patterns given different data sets.For example, the hyper-parameters of random survival forests (RSF) are number of trees, node size, split rules, number of variables that randomly selected for node splitting, et. al..[20] Usually, these hyperparameters are tuned by grid-search or random search with cross-validation.A nested crossvalidation is an approach to conduct hyper-parameters tuning and model selection simultaneously.It nests the cross-validation for hyper-parameters tuning inside the cross-validation for model selection.

Hyperparameter tuning
Given a list of hyperparameters from each model, the optimal set of hyperparameters is tuned with 5 repetitions of 3×3 folds nested cross-validation.The optimal set of hyperparameters is the set that reaches largest mean C-index of inner training and testing datasets.The final selected hyperparameters are shown in the Table S2.
[21]  It is an extension of random survival forests (RSF).[20]It generalizes RSF by using linear combination of variables to recursively partition the training data.Cox proportional hazards deep neural network (DeepSurv) DeepSurv is a deep learning method of survival analysis.It generalizes the Cox-PH model by modeling the hazard function as ℎ(|) = ℎ E ()  ™(`) where () is output of a deep feed-forward neural network.[38] S2.Supplemental Model Calibration and Hyperparameter Tuning Nested cross-validation Cross-validation is a widely employed technique for both model selection and performance evaluation.[39]Instead of training the model only once on the training data set which may generate over-fitting problem, to reduce the generalization error, cross-validation provides a simple way to reduce the bias by sampling.A -fold cross-validation randomly split the training data set into  non-overlapping folds; the model is then repeatedly trained  times.